ISYE6501 Notes

Week 1

Module 1 - Intro & What is Modeling

What is modeling?

  1. Describe a real-life situation mathematically
  2. Analyze the math
  3. Turn mathematical answer back into real-life situation

Can mean 3 things:

  1. General (regression)
  2. General with variables (regression using size, weight distance)
  3. Specific estimates (regression using bias + coefficients * variables)

Module 2 - Classification

Error & cost:

  • Minimizing errors: Classification lines try to divide data points with the max amount of distance to account for small data variations that might lead to classification errors.
  • Should also consider weight of errors, between FPs and FNs. The more costly the error for either, the more distance we want to shift the line away from it. - Reflects "risk tolerance". Relevant in medical analytics.

Features:

  • If you see a straight horizontal/vertical line between 2 features, it means that the feature parallel to the line is not useful.

Data Definitions

Tables

  • Every row is a "data point"
  • Each column of a data point contain "features", "attributes", "covariate", etc, ie. the X values.
  • Response/outcome is the "answer" of each data point, ie. the y value.

Structured Data

  • data stored in a structured way, either quantitatively (age) or categorically (gender, hair color)

Unstructured Data

  • Data not easily described and stored.
  • Example: Text

Common types of structured data

  1. Quantitative - numbers with meaning
  2. Categorical
    • Numbers without meaning (ex. zipcodes)
    • Non-numeric: Hair color
  3. Binary data (subset of categorical data)
    • Can only take one of two values
    • Can sometimes be treated as a quantitative measure
  4. Unrelated data - no relationship between data points
  5. Time series data - same data recorded over time (stock, height of a child)
    • Often recorded at same interval (but not necessarily)

Support Vector Machines

Applied to credit loan problem:

m = number of data points
n = number of attributes
x_ij = jth attribute of ith data point
  x_i1 = credit score of person 1
  x_i2 = income of person i
y_i = response for data point i
  y_i = 1 (if data point i is blue, -1 if data point i is red)
line = (a_1 * x_1) + (a_2 * x_2) + (a_n * x_n) + a_0 = 0

Notes:

  • Need to scale the data. If not, certain coefficients will be far more sensitive to change than others.
  • If a certain coefficient (feature) is near-zero, it is probably not relevant for classification.
  • Works the same in more dimensions (ie. more attributes)
  • Classifier doesn't have to be a straight line. Can use kernel methods to make non-linear classifiers.
  • Logistic regression is better for getting probability of classification (e.g. 37% likely to default).

Scaling data

Scaling linearly:

alt

Scaling to a normal distribution:

alt

Which method to use? Depends:

  • Use scaling for data in a bounded range.
    • neural networks
    • optimization models that need bounded data
    • batting average (0.000-1.000)
    • RGB color intensities (0-255)
    • SAT scores (200-800)
  • Use standardization for PCA / Clustering.
  • Sometimes it won't be clear. Try both!

Handwritten notes for SVM

img img

KNN Algorithm

Idea:

  1. Find the class of a new data point
  2. Pick the k-closest points ('nearest neighbors') to the new one
  3. The new point's class is the most common among the k-neighbors

Things to keep in mind:

  • Can be used for multi-class classification tasks.
  • "K" is a parameter that can be tweaked.
  • There is more than one way to calculate distance metrics.
  • Some attributes can be weighted by importance.

alt

Classification Summary

  • Divides data points into groups based on similarity
  • Graphical intuition
  • Basic solution methods
    • SVM
    • KNN
  • Data terminology
  • Validation
  • Distance metrics
  • Confusion matrices

Week 2

Module 3 - Validation

Data has 2 types of patterns:

  1. real effects - real relationship between attributes and response
  2. random effect - random, but looks like a real effect

"Fitting" matches both:

  • real effects: same in all data sets
  • random effects: different in all data sets
  • BUT: only real effects are duplicated in other data.

How to measure a model's performance:

  • larger set of data to fit the model
  • smaller set to measure the model's effectiveness

Train/Validate/Test to choose best model

What if we want to compare 5 runs of SVM and KNN?

alt

Problems:

  • Observed performance = real quality + random effects. High-performing models are more likely to have above-average random effects.
  • Just choosing best performing model is too optimistic.

Solution:

  • Use both validation and test data set.
  • Training:Build, Validation:Pick, Test:Estimate performance

alt

Splitting Data - Random/Rotation

How much data to use?

  • Typically 70-90% training, 10-30% test.
  • BUT to compare models: 50-70% train, split the rest equally between validation and test.

How to split data?

  1. Random: Randomly sample split (60:20:20) without replacement.
  2. Rotation: Take turns selecting points (e.g. 5 data point rotation sequence). Benefit: Equally separates data. Cons: May introduce bias, if rotations match a pattern (e.g. workdays)

Splitting Data - k-Fold Cross Validation

Idea: For each of the k parts - train the model on all the other parts and evaluate it on the one remaining.

  • Pros: Better use of the data (trains across all available data), better estimate of model quality and more effective way to choose a model.
  • For 4-fold, use 3 (ABC) for train, 1 (D) for test. Then rotate each fold to use as validation (see image). Essentially, each fold is used k-1 times for training, and 1 time for validation
  • Finally, average the k evaluations to estimate the model's quality.
  • Most common k is 10.
  • Important: Don't use the resulting models from k-fold cross validation, either by choosing one of the trained model or averaging the coefficients across the k splits. Train the model again using all the data.

alt

Module 4 - Clustering

Definition: Grouping data points

Use:

  • Targeted marketing / market segmentation (market on size / price / versatility / aesthetics)
  • Personalized medicine
  • Locating facilities
  • Image analysis (face recognition, captcha)
  • Data investigation

Distance Norms

Euclidean (straight-line) distance
  • 1-dimension $$ Distance = \sqrt(x_1-y_1)^2+(x_2-y_2)^2 = $$
Rectilinear distance
  • 2-dimension
  • Used for calculating distance in a grid.
  • Also called "Manhattan distance"
$$ Distance = |x_1 - y_1| + |x_2 - y_2| $$
p-norm Distance (Minkowski Distance)
  • n-dimension
  • Sum over all n dimensions
$$ Distance = \sqrt[p]{\sum_{i=1}^{n}|x_i - y_i|^p} $$
Infinity Norms
  • Definition: "The infinity norm simply measures how large the vector is by the magnitude of its largest entry." (reference)
  • It is essentially p-norm distance where p is set to infinity.
$$ Distance = \sqrt{\sum_{i-1}^{n}|x_i - y_i|^{\infty}} = \sqrt[\infty]{\text{max}_i|x_i - y_i|^{\infty}} = max_i|x_i - y_i| $$

Why use infinity norms?

  • Robotics example: Automated storage and retrieval system. Time it takes to store/retrieve depends on moving to the aisle (horizontal) and arm height (vertical).
  • A "one-norm" would be the robot moving to the aisle, then lifting its arm to the correct height.
  • An "infinity-norm" would be the robot moving + stretching its arms and the max time is dictated by whichever takes longer to complete.

K-Means Algorithm

How it works:

  1. Pick k cluster centers within range of data
  2. Assign each data point to nearest cluster center
  3. These data points are not the actual "center" of the cluster. To find the center, recalculater cluster centers (centroids). These centroids are the new centroids.
  4. After centroids are set, some data points might not be in the right cluster (determined by closeness to each centroid). Keep repeating steps 2 and 3 until there no more data point changes clusters.

Characteristics:

  • k-means is a heuristic: Fast, good but not guaranteed to find the absolute best solution.
  • Expectation-maximization - "Expectation" is the mean of a cluster's data point distance to the centroid. "Maximization" is the task - maximizing the negative to a cluster center.
K-means in practice
  1. Outliers: k-means assigns whichever closest cluster to the outlier. Easiest way to solve is remove those outliers. But, best practice is to better understand why those outliers are there.
  2. It's fast: Take advantage of its runtime by running several iterations.
    • For each run, choose a different initial cluster center (randomized initialization), then choose the best solution found.
    • Test different values of k as well.
  3. Understanding k for optimization.
    • Make sure to "fit the situation you're analyzing". k == # of data points may be the most theoretically optimal, but does that actually make sense for the task?
    • Solution: Plot k vs total distance to find the "elbow" of the curve. At a certain number, the benefit of adding another k becomes negligible.
K-means for predictive clustering

Classification task: Given a new data point, determine which cluster set the new data point belongs to. To do this, simply put it into whichever cluster centroid it is closest to.

Another classification task: What range of possible data points would we assign to each cluster?

Image of cluster region, aka "Voronoi diagram" img

Classification / Clustering - Differences

  • Difference is what we know about the data points
  • In classification, we know the correct classification of attributes. This is called supervised learning.
  • In clustering, we don't know the correct classification of attributes. This is called unsupervised learning.

img

Week 3

Module 5 - Basic Data Preperation

Data Preparation: Quantitative Examples

  • credit scores
  • average daily temperature
  • stock price

Things to watch out for before building models:

  • Scale of the data: Can throw off models. Beware of possible outliers.
  • Extraneous information: Complicates the model, harder to correctly interpret the solution.

Outlier Detection

Definition: Data point that's very different from the rest.

Types of outliers
  • point outliers: values far from the rest of the data.
    • Example: Outlier point in a cluster scatterplot
  • contextual outlier: value isn't far from the rest overall, but is far from points nearby in time.
    • Example: Time series data, low temperature at an unexpected time.
  • collective outlier: Something is missing in a range of points, but can't tell exactly where.
    • Example: Missing hearbeat in a time series chart of ECG data.
How to detect outliers
1. Box-and-whisker plot
  • helps find outliers (for 1D data)
  • box is 25/75th percentile of values
  • middle of the box is the median
  • "whiskers" expanding outside of box is the reasonable range of values expected of the data (e.g. 10/90th or 5/95th percentile)
  • Data points outside of the whisker could be considered outlier data.

img

2. Modeling error
  • Idea: Fit model, then find points of high error
  • Example: Fit exponential smoothing model to temperature/time series data.
  • Points with very large error might be outlier. Model will expect smooth curve, but actual data is far from it.

img

Dealing with Outliers

Depends on the data:

  • Bad data (failed sensor, bad experiment, bad input)... or maybe real data?
  • Need to root cause - possible avenues: find data source, understand how it was compiled.
  • With large data, outliers are expected.
    • In a normally-distributed data, 4% of data will be outside 2 standard deviations.
    • With 1,000,000 data points, >2000 expected outside 3 standard deviations.
    • Removing them can be too optimistic, account for noise.

Actions:

  • If really bad data, remove the data, use data imputation
  • If real/correct data, think whether the data you're modeling could realistically have outlier data points (e.g. force majeure incidents causing delays/cancellations).
  • Modeling outliers: Using logistic regression model, estimate probability of outliers happening under different conditions. Then, use a second model to estimate length of delivery under normal conditions, using data without outliers.

Module 6 - Change Detection

Definition: Determining whether something has changed.

Why:

  • Hypothesis tests often not sufficient for change detection, because they are slow to detect change.

Cumulative sum (CUSUM) approach

Definition: Detect increase/decrease or both by the cumulative sum.

How:

  • The basic idea is to calculate a running total of the change $t-1$, where t is the number of time steps.
  • If the running total is greater than zero we can keep it, otherwise the running total will reset to zero. This allows change to be detected quickly.
  • The running total ($S_t$) can then be used as a metric to determine if it goes above a threshold $T$.
  • $C$ is a buffer to pull the running down to prevent too many false positives from occurring.

Terms:

  • $X_t$ = observed value at time $t$
  • $\mu$ = mean of x, if no change
  • $C$ = buffer value
  • $T$ = threshold valued
  • $X_t - \mu$ = how much above expected the observation is at time $t$
  • $\mu - X_t$ = how much below expected the observation is at time $t$
Formula and Example:

img

Formula to detect increase (True if $S_t \ge T$)

$$ S_t = max{\left\{0, S_{t-1} + (X_t-\mu-C) \right\}} $$

Formula to detect decrease (True if $S_t \ge T$). $\mu$ and $X_t$ is flipped.

$$ S_t = max{\left\{0, S_{t-1} + (\mu-X_t-C) \right\}} $$

Note: Both can be used in conjunction to create a control chart, where $S_t$ is plotted over time and if it ever gets beyond this threshold line, it shows that CUSUM detected a change.

Week 4

Module 7 - Time Series Models

Exponential Smoothing

Time series data will have a degree of randomness. Exponential smoothing accounts for this by smoothing the curve.

Example:

  • $S_t$ - expected baseline response a time period $t$
  • $X_t$ - observed response at $t$
  • What is real blood pressure over time without variation?
  • Different blood pressure, increase in baseline, or random event?
  • Ways to answer:
    • $S_t = X_t$ - observed blood pressure is real indicator
    • $S_t = S_{t-1}$ - today's baseline is the same as yesterday's baseline

Exponential Smoothing Model

  • Sometimes called single, double, triple depending on how many aspects are considered (seasonality, trend)
  • Triple exponential smoothing is also called Winter's Method or Holt-Winters.
Formula
$$ S_t = \alpha x_t + (1 - \alpha)S_{t-1} $$$$ 0 < \alpha < 1 $$
  • $\alpha$ trades off between trusting $x_t$ when $\alpha$ is large, and trusting $S_{t-1}$ when $\alpha$ is small.
  • $\alpha$ -> 0: A lot of randomness in the system.
    • Trust previous estimate $S_{t-1}$
  • $\alpha$ -> 1: Not much randomness in the system.
    • Trust what you see $x_t$.
  • How to start? Set initial condidation: $S_1 = x_1$
  • Does not deal with trends or cyclical variations - more on that later.

Time series complexities

  • Trends (increase/decrease)
  • Cyclical patterns (temp/sales/biometric cycles)

Trends

Exponential Smoothing, but with $T_t$ (trend at time period $t$): $$S_t = \alpha x_t + (1 - \alpha)(S_{t-1} + T_{t-1})$$

Trend at time $t$ based on delta between observed value $S_t$ and $S_{t-1}$ with a constat $\beta$. $$T_t = \beta (S_t - S_{t-1}) + (1-\beta)T_{t-1}$$

  • Initial condition is $T_1 = 0$

Cyclic

Two ways to calculate:

  1. Like trend - additive component of formula, or
  2. Seasonalities: multiplicative way where:
    • $L$: Length of a cycle
      • e.g. $L=24$ for hourly reading of biometrics a day
    • $C_t$: Multiplicative seasonality factor for time $t$.
      • inflates/deflates the observation

Baseline formula (including trend + seasonality)

$$ S_t = \frac{\alpha x_t}{C_{t-L}} + (1 - \alpha)(S_{t-1} + T_{t-1}) $$

Update the seasonal (cyclic) factor in a similar way as trends:

$$C_t = \gamma(\frac{x_t}{S_t}) + (1 - \gamma)C_{t-L}$$
  • No initial cyclic effect: $C_1$, ..., $C_L$ = 1

Example: Sales trends

  • If $C$ is 1.1 on Sunday, then sales were 10% higher just because it was Sunday
  • Of 550 sold on Sunday, 500 is baseline, 50 is 10% extra

Starting Conditions

For trend:

  • $T_1 = 0$: No initial trend

For multiplicative seasonality

  • Multiplying by 1: shows no initial cyclic effect
  • First $L$ values of $C$ set to 1.

Exponential Smoothing: What the Name Means

"Smoothing"

img

  • The equation smoothes high and low spikes using $\alpha$.
    • If we have high observed value $x_t$, the baseline estimate $S_t$ is not as high. High value is only weighted by $\alpha$ which is less than one, and pulled down from high point by previous baseline $S_{t-1}$.
    • If we have low observed value, $(1-\alpha)S_{t-1} term pulls the estimate up from the low observed value.

Graph of what it looks like: img

"Exponential"

Each time period estimate can be plugged in like this: img

  • Each time period is weighted by $1-\alpha$ to an increasing exponent.
  • Significance: Every past observations contributes to the current baseline estimate $S_t$.
  • More recent observation are weighted more, i.e. more important.

Applications in Forecasting

Given basic exponential smoothing equation

$$S_t = \alpha x_t + (1-\alpha)S_{t-1}$$

We want to make a prediction $S_{t+1}$. Since $X_{t+1}$ is unknown, replace it with $S_t$.

Using $S_t$, the forecast for time period $t+1$ is

$$F_{t+1} = \alpha S_t + (1-\alpha)S_t$$

hence, our estimate is the same as our latest baseline estimate

$$F_{t+1} = S_t$$

Factoring in trend/cycle

Above equation can beextrapolated to trend/cycle calculations.

Best estimate of trend is the most current trend estimate:

$$F_{t+1} = S_t + T_t$$

Same for cycle (multiplicative seasonality)

$$F_{t+1} = (S_t + T_t) C_{(t+1)-L}$$

Where $F_t+k = (S_t + kT_t)C_{(t+1)-L+(k-1)}$, k=1,2,...

AutoRegressive Integrated Moving Average (ARIMA)

3 key parts

1. Differences

  • Exponential smoothing assumes data is stationary, but reality is that it is not.
  • However, the difference might be stationary.
  • For example:

    • First-order difference $D_{(1)}$: difference of consecutive observations

    $$D_{(1)t} = (X_t - X_{t-1})$$

    • Second-order difference $D_{(2)}$: differences of the differences

    $$D_{(2)t} = (x_t - x_{t-1}) - (x_{t-1} - x_{t-2})$$

    • Third-order difference $D_{(3)}$: differences of the differences of the differences

    $$D_{(3)t} = [(x_t - x_{t-1}) - (x_{t-1} - x_{t-2})] - [(x_{t-1} - x_{t-2}) - (x_{t-2} - x_{t-3}]$$

    • The difference of differences = d times

2. Autoregression

Definition: Predicting the current value based on previous time periods' values.

Augoregression's exponential smoothing:

  • order-$\infty$ autoregressive model.
  • uses data all the way back.

Order-p autoregressive model:

  • Go back only p time periods.

"ARIMA" combines autoregression and differencing

  • autoregression on the differences
  • use p time periods of previous observations to predict d-th order differences.

3. Moving Average

  • Uses previous errors $\epsilon_t$ as predictors $$\epsilon_t = (\hat{x} - x_t)$$
  • Apply order-q moving average (go back q time periods) $$\epsilon_{t-1}, \epsilon_{t-2}, \ldots , \epsilon_{t-q}$$

ARIMA (p,d,q) model

$$ D_{(d)t} = \mu + \sum_{i=1}^{p}\alpha_i D_{(d)t-i} - \sum_{i=1}^{q}\theta_i(\hat{x}_{t-1} - x_{t-i}) $$

Choose:

  • p-th order autoregression
  • d-th order difference
  • q-th order moving average
  • Once chosen, statistical software can be used to find the p, d, q constants through optimization.

Other flavors of ARIMA

  • Add seasonal values of p, d, q
  • Set specific values of p, d, q to get certain qualities
    • ARIMA(0,0,0) = white noise
    • ARIMA(0,1,0) = random walk
  • Can generalize a lot of simpler models:
    • ARIMA(p,0,0) = AR (autoregressive) model
    • ARIMA(0,0,q) = MA (moving average) model
    • ARIMA(0,1,1) = basic exponential smoothing model
  • Short-term forecasting
    • Usually better than exponential smoothing if:
    • Data is more stable, with fewer peaks/valleys/outliers.
  • Need ~40 past data points for ARIMA to work well.

Generalized Autoregressive Conditional Heteroskedasticity (GARCH)

Definition: Estimate or forecast the variance of something, given a time-series data.

Motivation:

  • Estimate the amount of error.
  • Example: Forecast demand for pickup trucks
    • Variance tells how much forecast might be higher/lower than true value
  • Example: Traditional portfolio optimization model.
    • Balances expected return of a portfolio with amount of volatility. (Risker: Higher expected return, vice-versa)
    • Variance is a proxy for the amount of volatility/risk.

Mathematical Model:

$$ \sum_t^2 = \omega + \Sigma_{i-1}^p\beta_i\sigma_{t-i}^2 + \sum_{i=1}^q \gamma_i\epsilon_{t-i}^2 $$
  • Equation is similar to ARIMA, with 2 differences
    1. Uses variance/squared errors, not observations/linear errors.
    2. Uses raw variances, not differences of variances.
  • Stat software can fit the model, given p and q (don't need d since GARCH doesn't use differences)

SUMMARY

  1. Exponential smoothing
  2. ARIMA - generalized version of exponential smoothing
  3. GARCH - ARIMA-like model for analyzing variance

Week 5

Basic Regression

What it explains:

  • Causation / Correlation
    • Value of a home run
    • Effect of economic factors on presidential election
  • Prediction
    • Height of child at adulthood
    • Future oil price
    • Housing demand in next 6 months

Simple Linear Regression (SLR)

Definition: Linear regression with one predictor

  • Looks for linear relationship between predictor and response.

img

Sum of Squared Error

Example:

  • $y_i$ = cars sold for data point i
  • $\hat{y}_i$ = model's prediction of cars sold
  • Data point i's prediction error: $y_i - \hat{y}_i = y_i - (a_0 + a_1 x_{i1})$

Sum of Squared Errors $$ \sum_{i=1}^n(y_i - \hat{y}_i)^2 = \sum_{i=1}^n(y_i - (a_0 + a_1 x_{i1})^2 $$

Best-fit regression line

  • line moves around the dimension to minimize sum of squared errors. As it moves, the values of the coefficients $a_0$ and $a_1$ change.
  • Defined by $a_0$ and $a_1$

Maximum likelihood (or how to measure the quality of a model's fit)

  • Likelihood: Measure the probability (density) for any parameter set.
  • Maximum likelihood: Parameters that give the highest probability.

How it works:

  • Assume that the observed data is the correct value and we have information about the variance.
  • Then for any set of params we can measure the probability (density) that the model would generate the estimates it does.
  • Whichever set of params that gives the highest probability density (ie. maximum likelihood). is the best-fit set of params.

Maximum Likelihood - Example

  • Assumption: distributino of errors is normal with mean of zero and variance $\sigma^2$, and independently and identically distributed.
  • Observation: $z_1$, ..., $z_n$
  • Model estimates: $y_1$, ..., $y_n$
  • Maximum-liklihood expection: The set of parameters that minimizes the sum of squared errors

img

MLE in Linear Regression

img

  • $x_{ij}$ is the _j_th observed predicted value of data point $i$.
  • $a_0$ through $a_m$ are the params we're trying to fit.

Maximum Likelihood Fitting

  • Simple example: regression with independent normally distributed errors
  • But can get complex with:
    • different estimation formulas
    • different assumptions about the error
  • Stat software can handle most of these complexities.

Models that Combine Maximum Likelihood with Model Complexity

Akaike Information Criterion (AIC)

  • $L^*$: Maximum likelihood value
  • $k$: number of parameters being estimated
$$ AIC = 2k = 2\ln{(L^*)} $$
  • $2k$ is the penalty term. Balances model's likelihood with its simplicity. This is done to reduce the chance of overfitting by adding too many parameters.
  • Models with the smallest AIC is preferred.

AIC applied to regression

  • $m+1$: number of parameters
  • $L^*$ (max likelihood val) substitue: Linear regression function.

img

Corrected AIC ($AIC_c$)

  • AIC weakness: Works well if there are infinitely many data points.
  • Correct AIC accounts for this by adding a correction term.

Equation:

  • $n$: Correction term
$$ AIC_c = AIC + \frac{2k(k+1)}{n-k-1} = 2k - 2\ln{(L^*)} + \frac{2k(k+1)}{n-k-1} $$

Comparing AIC / AICc models by Relative Probability

Example:

  • Model 1: AIC = 75
  • Model 2: AIC = 80

Relative likelihood $$ e^\frac{(AIC_1 - AIC_2)}{2} $$

Applied to Models 1 & 2: $$ e^\frac{(75 - 80)}{2} = 8.2\% $$

Result:

  • Model 2 is 8.2.% as likely as Model 1 to be better.
  • Considering lower AIC is better, Model 1 will be the better choice

Bayesian Information Criterion (BIC)

  • $L^*$: Maximum likelihood value
  • $k$: Number of parameters being estimated
  • $n$: Number of data points
$$ BIC = k \ln{(n)} - 2 \ln{(L^*)} $$

Characteristics:

  • Similar to AIC
  • BIC's penalty term is greater than AIC.
  • Encourage models with fewer params than AIC.
  • Only use BIC when there are more data points than parameters

BIC Metrics - Rule of thumb

img

Summary of AIC / BIC

  • AIC = Frequentist, BIC = Bayesian
  • No hard-and-fast rule for using AIC or BIC or maximum likelihood.
  • Looking at all three can help you decide which is best.

Understanding Regression Coeficients

Baseball Example: Determine average number of runs a home run is worth.

  • Response: How many runs a team scored taht season
  • Predictors: # of home runs, triples, doubles, singles, outs, etc.

Equation:

$$ Runs Scored = a_0 + a_1[Number of HR] + a_2[Number of Triples] + \ldots + a_m[Other Predictors] $$

Applications of LR:

  1. Descriptive Analytics
  2. Predictive Analytics
  3. ~Prescriptive Analytics~ (Not used for this)

Causation vs Correlation

Causation: One thing causes another thing Correlation: Two things tend to happen or not happen together - neither of them might cause the other.

Application in Linear Regression:

  • x: city's avg daily winter temp
  • y: hours/days spent outdoors in winter
  • x may have causation over y (warmer avg temp -> more time outdoors in winter)
  • y does not have causation over y (more time outdoors in winter -> warmer avg temp)

How to decide if there is causation?

  • Cause is before effect.
  • Idea of causation makes sense (subjective)
  • No outside factors causing the relationship (hard to guarantee)

Transforming Data for Generalized LR Modeling

img

  • Problem: LR won't fit data on right.
  • Solution: Transform the data

Method 1: We can adjust the data so the fit is linear.

  • Quadratic Regression
$$ y = a_0 + a_1x_1 + a_2x_1^2 $$

Method 2: We can transform the response

  • Transform response with logarithmic function $$ \log(y) = a_0 + a_1x_1 + a_2x_2 + \ldots + a_mx_m $$

  • Box-Cox transformations (link) can be automated in statistical software.

When to Transform Data

Example: Variable interaction

  • Ex: Estimating 2-yr old's adult height.
  • Predictors:
    • Father's height
    • Mother's height
    • Product of the parents' heights ($x_1x_2$)
  • We can the third predictor as a new input $x_3$
$$ y = a_0 + a_1x_1 + a_2x_2 + a_3(x_1x_2) $$

How to Interpret Output

1: P-Values

  • Estimates the probability that coefficient might be zero (ie. type of hypothesis testing).
  • Rule of thumb: If coefficient's p-value is > 0.05, remove the attribute from the model.
  • Other thresholds besides 0.05 can be used:
    • Higher: More factors included
      • Con: May include irrelevant factors
    • Lower: Less factors included
      • Con: May leave out relevant factors.

Warnings About P-Value

  • With large # of data, p-values get small even when attributes are not related to response.
  • P-values are only probabilities, even when if it seems meaningful.
    • Example: 100 attrs - 0.02 p-value each
    • Each has 2% chance of not being significant.
    • Expect that 2 of them are really irrelevant.
  • Some things to think about (from office hours):
    • Does the effect size seem sensible to support such small p-value or p-values are shrink due to sample size?
    • What happens to p-value a given attribute if others are removed from the regression model?
    • What does the known relationship between response and covariate tell you about this p-value? Is it likely to be a spurious but strong correlation or likely to be a genuine effect?

2: Confidence Interval (CI)

  • Where the coefficient probably lies
  • How close that is to zero.

3: T-Statistic

  • The coefficient divided by its standard error.
  • Related to p-value.

4: Coefficient

  • Check if coefficient doesn't make any difference, even if low p-value.
  • Example: Estimating household income with age as an attribute.
    • If coefficient of age is 1, even with low p-value, the attribute isn't important.

(A bit more explanation on contextualizing the coefficient of 1.0 in this context explained in Piazza):

Notice that in that lecture (M5L6, timestamp ~3:10) he's talking about the value of the coefficient relative to the scale of the attribute in question and the response value. A coefficient of 1.0 on the age attribute isn't that significant because the coefficient is in units of USD/year of age, and the response is a household income. We can make a decent guess what values the age variable will take (probably adult, probably younger than retirement age), and those values multiplied by 1.0 (on the order of tens of dollars) aren't going to make much difference in a US household income (on the order of tens of thousands of dollars). That phrase "the coefficient multiplied by the attribute value" is important.

So this isn't a rule of thumb about the value 1.0. This is advice to keep the scale of your attributes and response in mind when interpreting coefficients.

5: R-Squared Value

  • Estimate of how much variability your model accounts for.
  • Example:
    • R-square = 59%
    • THis mean model accounts for ~59% of variability in the data.
    • Remaining 41% is randomness or other factors.

Adjusted R-squared: Adjusts for the number of attributes used.

Warnings about R-squared:

  • Some things are not easily modeled, especially when trying to model real-life systems.
  • In these cases, even a R-squared of 0.3-0.4 is quite good.

From Office Hours: W5 HW Review

Key Assumptions of Linear Regression

  1. Linearity - There needs to exist a linear relationship between X and Y
  2. Independent - (no auto correlation) each point is independent from each other. Time series violates this.
  3. Normlality - Residual (difference between predicted and actual value) needs to be normal. Don't want to see pattern in residuals.
  4. Constant variances (homoscedasticity) - residual need to have constant variance.
  • P-value (in Linear Regression) is probability that the coefficien'ts value is actually zero. Higher P-value means greater probability.
  • Multiple R-squared: Percentage of variance explained by your model (0 - 1)
  • Adjusted R-squared: Counts number of predictors used. More predictors lowers R-squared. Reduce bias in using too many predictors.
  • Big diff between multile vs adjusted R-squared means there are probably too many predictors.

How to calculate R-squared values directly from cross validation

# R-squared = 1 - SSEresiduals/SSEtotal

# total sum of squared differences between data and its mean
SStat <- sum(dat$Crime - mean(data$CRime))^2)

# for model, model2, and cross-validation, calculated SEres
SSres_model <- sum(model$residuals^2) #model 1
SSres_model2 <- sum(model2$residuals^2)
SSres_c <- attr(c, "ms")*nrow(dat) # MSE, times number of data points, gives sum of squared error

# Calculate R-squared
1 - SSres_model/SStat # initial model with insignificant predictors
# 0.803

1 - SSres_model2/SStat # model2 without insignificant predictors (based on p-value)
# 0.766

1 - SSres_c/SStat # cross-validated
# 0.638

# This shows that including the insignificant factors overfits
# compared to removing them, and even the fitted model is probably overfitted.
# This is not suprising since we only have 47 data points and 15 predictors (3:1 ratio)
# Good to have 10:1 or more.

HW 6 Preview - PCA

Q0.1 - Using crime datset, apply PCA and then create a regression model using the first few principal components.

  • specify new model in terms of original variables (not principal)
  • compare quality of PCA mode lto that of your solution to Q8.2 (lin reg with no PCA)
  • You can use R function prcomp for PCA (scale data first scale=TRUE)
  • Don't forget to unscale the coefficients to make a prediction for the new city (ie. do the scaling calculation in reverse).

  • Eigenvalues Correspond to Principal Components

  • Eigenvectors correspond to the direction of the new axes.
    • Can find through eigendecomposition, or SVD of data covariance matrix.
    • Data is then projected onto these axes (matmult) to get new features.
    • prcomp does all of this under the hood
    • Still need to scale.
  • PCA will return n principle components: Usually only want to use the first several (usually 2). Using all doesn't make sense because you're not taking advantage of the reduced dimensionality.

How to decide which PC is important

  • prcomp will rank them for us.
  • 1st will have highest eigenvalue, and so forth.
  • You can plot this by cumulative explained variance vs individual explained variance (see plot)

img

Week 6 - Advanced Data Preparation

Box-Cox Transformations

Why: Normality assumption

  • Some models assume data is normally distributed. Bias occurs when this assumption is wrong.
  • Left chart is normal, right chart shows heteroscedasticity (unequal variance)
    • (If applying regression model to right chart) Higher variance at upper end can make those estimation errors larger, and make model fit those points better than others.
  • We can account for this via several methods.

img

Box-Cox Transformation

  • What: Transforms a response to eliminate heteroscedasticity.
  • How: Applies logarithmic transformation by:
    1. stretches smaller range to increase its variability.
    2. shrinks larger range to reduce its variability.

Box-Cox Formula:

  • $y$: Vector of responses
  • $t(y)$: Transformed vector
  • $\lambda$: Param to adjust
  • Goal: Transform $t(y)$ to normal distribution
$$t(y) = \frac{y^\lambda - 1}{\lambda}$$

NOTE: First check whether you need the transformation (e.g. QQ plot)

Detrending

Why: Time series data will often have a trend (e.g. inflation, seasons) which may bias the model. Need to adjust for these trends to run a correct analysis.

  • Example: Price of gold over time, raw price versus inflation-adjusted price.
  • If fitting regression model based on other factors with data as-is, it will return a very different response using same factor values depending on the decade.
  • Adjusting inflation will return a closer value regardless of time.

When: Whenever using factor-based model (regression, SVM) to analyze time-series data*

("Factor-based model" uses a bunch of factors to make a prediction, non-factor based model example would be a model using predictors based on time and previous values)

On What:

  • Response
  • Predictors (time series data)

How:

  • Factor-by-factor
    • 1D regression: $y = a_0 + a_1x$
    • Ex. Simple linear regression of gold prices ($Price = -45,600 + 23.2 * Year$)
    • Detrended: $Actual price - (-45,600 + 23.2 * Year)$
  • Example charts: Raw (left), Inflation-adjusted (center), Detrended via linear regression (right)
    • Center and right a bit different bc linear regression accounts for different inflation rates each year.

img

Intro to Principal Component Analysis (PCA)

Why:

  1. We may have too many factors, difficult deciding which ones are important.
    • Ex. Endless number of features to choose from to predict stocks, how to reduce this dimension.
  2. There may be high correlation between some of the predictors. Leads to redundant predictors.

What:

  • PCA transforms data by:
    1. Removing correlations within the data.
    2. Ranking coordinates by importance.
  • Give you n principal components:
    • Concentrate on first n PCs.
    • Pro: Reduce effect of randomness.
    • Pro: Earlier PCs likely to have higher SNR (signal-to-noise ratio).
  • Plot of PCA (chart) in 2 PCs:
    • D_1 becomes new x-coordinate.
    • D_2 becomes new y-coordinate.
    • New coordinate's correlation becomes zero, or orthogonal.
    • PCA ranks PCs by amount of spread in value (e.g. D_1 has more spread than D_2, so it becomes PC_1)

img

How PCA Works (Math)

$X$: Initial matrix of data (scaled)

  • $i$: data point
  • $j$: factor
  • $X_{ij}$: $j$-th factor of data point $i$
  • Scaled such that $\frac{1}{m} \Sigma_i x_{ij} = \mu_j = 0$
    • For each factor $j$, the average value of $x_{ij}$ over all data points is shifted to be zero.

Find all of the eigenvectors of $X^TX$

  • $V$: Matrix of eigenvectors (sorted by eigenvalue)
  • $V = [V_1, V_2 \ldots ]$, where $V_j$ is the $j$-th eigenvector of $X^TX$

PCA:

  • Each principal component is a transformation of the scaled data matrix ($X$) and eigenvector matrix ($V$) (ie. $XV_n$)
  • $XV_1$ = first component, $XV_2$ = second component
  • Formula for $t_{ik}$, ie. the $k$-th new factor value for the $i$-th data point:
$$ t_{ik} = \Sigma^m_{j=1}x_{ij}v_{jk} $$

How to use beyond linear transformation

  • Use kernels for nonlinear functions.
  • Idea is similar to SVM modeling.

How to interpret the model? (Regression)

Problem: How do we get the regression coefficients for the original factors instead of PCs?

Example (Regression): PCA finds new $L$ factors ${t_{ik}}$, then regression finds coefficients $b_0, b_1, \ldots, b_L$.

$$ \begin{aligned} y_i &= b_0 + \Sigma^L_{k=1} b_k t_{ik} \\ &= b_0 + \Sigma^L_{k=1} b_k [ \Sigma^m_{j=1} x_{ij} v_{jk} ] \\ &= b_0 + \Sigma^L_{k=1} b_k [\Sigma^m_{j=1} x_{ij} v_{jk}] \\ &= b_0 + \Sigma^L_{k=1} x_{ij} [\Sigma^L_{k=1} b_k v_{jk}] \\ &= b_0 + \Sigma^L_{k=1} x_{ij} [a_j] \end{aligned} $$

Implied regression coefficient for $x_j$: $$ a_j = \Sigma^L_{k=1} b_k v_{jk} $$

  • Each original attribute’s implied regression coefficient is equal to a linear combination of the principal components’ regression coefficients.

Eigenvalues and Eigenvectors

Given $A$: Square matrix

  • If we can find vector $v$ and constant $\lambda$ such that $A_v = \lambda_v$,
  • $v$ is eigenvector of $A$
  • $\lambda$ is eigenvalue of $A$
    • $det(A - \lambda l ) = 0$, meaning:
    • For every value of $\lambda$ where the determinant of $A$ minus $\lambda$ times identity matrix equals zero, every one of those $\lambda$ values is an eigenvalue of $A$.
  • Opposite direction: Given $\lambda$, solve $A_v = \lambda v$ to find corresponding eigenvalue $v$

PCA: Pros and Cons

Points to consider:

  • Just because the first principal component has more variation doesn't mean it's always the most helpful for predictive modeling. This is because PCA depends only on independent variables, not response or calculation.
  • This could (on rare occasions) lead to cases where the response is only affected by variables with low variability, not by those with high variability (?)

Example: Classification

  • Left (Good): PCA gives PCs D1 and D2. D1 is helpful for differentiating between blue/red points. D2 essentially gives dividing line between the two colors.
  • Right (Bad): D1 doesn't help separate between red and blue points. D2 is actually the more helpful separator.

img

Takeaway:

  • Usually, dimensions that have more variability (or better differentiators), specifically bc they have more information. But this is not always true.

Week 7

Advanced Regression (Trees)

Types:

  • CART (Classification and Regression Trees)
  • Decision Tree

Regression Trees

Terminology:

  • branch: points where a splitting heuristic is applied.
  • nodes: a group of data points after the data is split.
  • internal nodes: non-terminal nodes
  • leaf nodes are the terminal nodes where there are no more branches beyond it.

img

Method:

  1. Stratify feature space: Split data points by a certain predictor (e.g. age). This is greedy recursive binary splitting (refer to ISL Ch.8).
    • Typically use several predictors to determine split (e.g. age, number of kids, household income).
  2. Create a regression model at each split, using only the data points that belong in each node. This creates multiple regression models.
  3. Prediction: For each new data point pass the data through the decision tree and then use the leaf node's coefficient to return a response.

How regression trees are actually calculated

  • Problem: Running a full regression models on each node is expensive, plus each depth will have higher likelihood of overfitting (because less data).
  • Solution: Just calculate the constant term. We do this by getting the average response in the node:
$$ a_0 = \frac{\Sigma_i\text{in node} y_i}{\text{num data points in node}} = \text{avg response in node} $$

Derivation: img

Other uses of trees The model used at each node would be different:

  1. Logistic regression models: Use fraction of node's data points with "true" response.
  2. Classification models: Use most common classification among node's data points.
  3. Decision models: At each branch determines whether or not we meet a threshold to make a decision (e.g. send marketing email).

Branching

2 things to consider:

  1. Which factor(s) should be in the branching decision?
  2. How should we split the data?

Branching Method

(Note, this is one method - there are other ways to do this)

Key Ideas:

  • Use a metric related to the model's quality.
  • Find the "best factor" to branch with.
  • Verify that the branch actually improves the model. If not, prunce the branch back.

How to reject a potential branch

  • Low improvement benefit (based on threshold)
  • One side of the branch has too few data points now (e.g. each leaf must contain at least 5% of original data).

"Overfitting our model can be costly; make sure the benefit of each branch is greater than its cost"

img

Random Forests

Motivation

  • Introduce randomness
  • Generate many trees
  • Average of multiple trees results in better predictions.

Introducing randomness

We introduce randomness in 3 steps:

  • Bootstrapping: Sample N datapoints from original data and build a tree.
  • Branching: Randomly choose a subset of factors, and choose best factor form the subset at each branch.
    • Common number of factors to use is $1+log_n$
  • No pruning.

Effect:

  • Each tree in the forest has slightly different data.
  • We end up with a lot of different trees (usually 500-1000)
  • Each tree gives us a different regression model.

Results:

  1. For regression, use the average predicted response (average).
  2. For classification, use the most common response (mode).

Pros/Cons:

  • Pro: Average between trees can neutralize over-fitting. This gives better predictions.
  • Con: Harder to explain/interpret. Doesn't provide direct weights for each predictor. "Black-box" predictor.

Explainability / Interpretability

Why explainability matters:

  • Helps us understand the "why" of the model.
  • Helps stakeholders accept our ideas.
  • Sometimes a legal requirement.
  • However, less explainable models may give better results (can fit better to complex data).

Linear Regression Example

img

Takeaway: Linear regression models are pretty explainable. Each coefficients is tied to a specific predictor.

Regression Tree Example

img

Takeaway: Tree-based models are less explainable. Branching logic is often not intuitive. Random forest is even worse.

Logistic Regression

Definition: Turns regression model into a probability model.

Standard Linear Regression:

$$ y = a_0 + a_1x_1 + \ldots + a_jx_j $$

Logistic Regression: Takes linear function and puts it into an exponential.

  • $p$: probability of the even you want to observe
$$ \log{\frac{p}{1-p}} = a_0 + a_1x_1 + \ldots + a_jx_j $$$$ p = \frac{1}{1+e^{-(a_0 + a_1x_1 + \ldots + a_jx_j)}} $$

Now this function can return a number from $-\infty$ to $+\infty$.

  • If $-\infty$, then $p=0$
  • If $+\infty$, then $p=1$

Logistic Regression Curve

Example 1:

img

  • Size of each dot shows how many observations there are.

Example 2:

img

  • If there are 0/1 responses for almost every predictor value, we can visualize it like above.
  • Each bar shows the fraction of responses that are one for each predictor value
  • Shows how LR curve fits the data.

Logistic Regression vs Linear Regression

Similarities

  • Transforms input data
  • Consider interaction terms
  • Variable selection
  • Single/random logistic regression trees exist.

Differences

  • Longer to compute
  • No closed-form solution
  • Understanding quality of model

Measures of model quality:

  • Linear regression used R-squared value (% of variance in the response that is explained by the model)
  • Logistic regression uses a "pseudo" R-squared value, because it's not really measuring % of variance.

Logistic Regression as Classification

Thresholding:

  • Answer "yes" if $p > N$, otherwise answer no.
  • Example: if $p \ge 0.7$, give loan.

Receiver Operating Characteristic (ROC) Curve

img

Area Under Curve (AUC) = Probability that the model estimates a random "yes" point higher than a random "no" point.

  • Higher means less random prediction (surrogate for quality).
  • Example: Loan payment
    • Joe: repaid loan, Moe: Didn't repay loan.
    • AUC explains probability that model gives Joe's data point a higher response value than Moe.
  • $AUC = 0.5$ means we are just guessing.
  • Also called "corcordance index".
  • Doesn't differentiate cost between False Negative and False Positive__.

Confusion Matrix

img

There are many metrics derived from confusion matrix, main ones to know are:

  • $Sensitivity = \frac{True Positive}{True Positive + False Negative}$
  • $Specificity = \frac{True Negative}{True Negative + False Positive}$
  • $Precision = \frac{True Positive}{True Positive + False Positive}$
  • $Recall = \frac{True Positive}{True Positive + False Negative}$

Evaluating a model's quality using Confusion Matrix

Example: Spam detection, confusion matrix

img

Idea: We can assign costs for each factor and calculate the sum.

$$ Cost = TP * Cost + FN * Cost + FP * Cost + TN * Cost $$

Example: Cost of lost productivity

  • \$0 for correct classifications.
  • \$0.04 to read spam
  • \$1 to miss a real message
  • Total cost: \$14
$$ \$14 = 490 * \$0 + 10 x \$1 + 100 * \$0.04 + 400 * \$0 $$
  • If ratio of spam changes, add a scaling factor into the equation (new value divided by old value):

img

Poisson regression

  • Used when we think the response follows a Poisson distribution.
  • Example: count arrivals at an airport security line.
    • arrival rate might be a function of time.
    • We estimate $\lambda(x)$

Regression splines

  • Spline: Function of polynomials that connect to each other.
  • Knot: Point at which each function connects to another.
  • Fit different functions to different parts of the data set.

img

Other splines:

  • Order-k regression spline: polynomials are all order k
    • Example: multi-adaptive regression splines (MARS)
    • Called "Earth" in many stat libraries.

Bayesian Regression

We start with:

  • Data, plus
  • Estimate of how regression coefficients and the random error is distributed.

  • Example: Predict how tall a child will be as an adult based on:

    • Data: heights of the child's mother and father
    • Expert opinion: starting distribution.
      • Example: coefficients uniformly distributed between 0.8 and 1.2
    • Use Bayes' theorem to update estimate based on existing data.
    • Most helpful when there is not much data. Use small existing data to temper expert opinion.
    • If no expert opinioni, choose broad prior distribution (e.g. uniform over large interval).

KNN Regression

  • No estimate of prediction function.
  • Plot all the data.
  • Predict response, using average response of K closest data points.
In [ ]: